Goal of this part is a rapid overview of the main tools of data science: importing, tidying, transforming, visualizing
# tidyverse packages
# install.packages('tidyverse')
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
It tells you which functions from the tidyverse conflict with functions in base R or other packages.
# install.packages('palmerpenguins')
# install.packages('ggthemes')
library(palmerpenguins)
library(ggthemes)
Use palmerpenguins package, which include the penguins
dataset. Also the ggthemes package offers a colorblind sage color
palette
Do penguins with longer filppers weigh more or less than penguins with shorter flippers? What does the relationship between flipper length and body mass look like? Is it positive? negative? linear? nonlinear? Does the relationship vary by the species of the penguins? How about by the island where the penguin lives?
In the tidyverse, it use special dataframes called tibbles
penguins
glimpse(penguins) # str(penguins)와 비슷
## Rows: 344
## Columns: 8
## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <fct> male, female, female, NA, female, male, female, male…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Our ultimate goal is to create visualization displaying the relationship between flipper lengths and body masses of these penguins, taking into consideration the species of the penguin.
In ggplot2, we begin a plot with the function
ggplot(). It defines a plot object that you then add layers
to.
arguments are
data: dataset to use in the graphmapping: defines how variables in our dataset are
mapped to visual properties(aesthetic) of our plotggplot(data = penguins)
It creates empty graph that is primed to display the data. We can think of it like an empty canvas we’ll paint the reaming layers of our plot onto.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
)
mapping argument is always definend in the
aes() function, and x, y
areguments of aes() specify which variables to map to the x
and y axes.
We need to define a geom: the geometrical object that a plot uses to represent data.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm, y = body_mass_g)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Warning message: ggplot2 subscribes to the philosophy that missing values should never silently go missing.
It is always a good idea to be skeptical of any apparent relationship
between two variables and ask if there may be other variables that
explain or change the nature of this apparent relationship.
For example, does the relationship between flipper length and body mass
differ by species?
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species)
) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
Scaling: When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic to each unique level of the variable. ggplot2 will also add a legend that explains which values correspond to which levels
Let’s add one more layer: a smooth curve displaying the relationship between body mass and flipper length.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g,
color = species)
) +
geom_point() +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
When aesthetic mappings are defined in ggplot(), at the
global level, they are passed down to each of the subsequent geom layers
of the plot.
However, each geom function in ggplot2 can also take a
mapping argument, which allows for aesthetic mappings at
the local level.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)
) +
geom_point(mapping = aes(color = species)) +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
It’s generally not a good idea to represent information using only colors on a plot, as people perceive colors differently due to color blindness or other color vision differences.
ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = 'lm')
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
We can improve the labels of out plot using the labs()
function in a new layer.
arguments are
titlesubtitlexycolor and shape: define the label for the
legendscale_color_colorblind(): imporve the color palette to
be colorblind safe(from ggthemes package)ggplot(
data = penguins,
mapping = aes(x = flipper_length_mm,
y = body_mass_g)
) +
geom_point(mapping = aes(color = species, shape = species)) +
geom_smooth(method = 'lm') +
labs(
title = 'Body mass vs. Flipper length',
subtitle = 'Dimensions for Adelie, Chinstrap, and Gentoo Penguins',
x = 'Flipper length (mm)',
y = 'Body mass (g)',
color = 'Species',
shape = 'Species'
) +
scale_color_colorblind()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
How to visualize the distribution of a variable depends on the type of variable
CategoricalNumericalA variable is categorical if it can only take one of a small set of values. To examine the distribution of a categoriccal variable, we can use a bar chart.
ggplot(penguins, aes(x = species)) +
geom_bar()
In bar plots of categorical variables with non-ordered levels, its often preferable to reorder the bars based of their frequencies. It requires transforming the variable to a factor and then reordering the levels of that factor.
ggplot(penguins, aes(x = fct_infreq(species))) +
geom_bar()
A variable is numerical or quantitative if it can take on a wide
range of numerical values. Numerical variables can be continuous or
discrete.
One commonly used visualization for distributions of continuous variable
is a histogram
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 200)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
A histogram divides the x-axis into equally spaced bins and then uses
the height of a bar to display the number of observations that fall in
each bin.
Since different binwidths can reveal different patterns, we have to
explore a variety of binwidths when working with histogram.
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 20)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
ggplot(penguins, aes(x = body_mass_g)) +
geom_histogram(binwidth = 2000)
## Warning: Removed 2 rows containing non-finite values (`stat_bin()`).
An alternative visualization for distributions of numerical variables
is a density plot. A density plot is a smoothed-out version
of a histogram. It shows fewer details than a histogram but can make it
easier to quickly glean the shape of the distribution, particularly with
respect to modes and skewness.
ggplot(penguins, aes(x = body_mass_g)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
To visualize a relationship we need to have at least two variables.
To visualize the relationship between a numerical and a categorical
variable we can use side-by-side box plots.
ggplot(penguins, aes(x = species, y = body_mass_g)) +
geom_boxplot()
## Warning: Removed 2 rows containing non-finite values (`stat_boxplot()`).
Alternatively, we can make density plots with
geom_density().
ggplot(penguins, aes(x = body_mass_g, color = species)) +
geom_density()
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
ggplot(penguins, aes(x = body_mass_g, color = species, fill = species)) +
geom_density(linewidth = 2, alpha = 0.7)
## Warning: Removed 2 rows containing non-finite values (`stat_density()`).
We can use stacked bar plot to visualize the
relationship between two categorical variables.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar()
The second plot is a relative frequency plot. It is more useful for comparing species distributions across the islands since it’s not affected by the unequal numbers of penguins across the islands.
ggplot(penguins, aes(x = island, fill = species)) +
geom_bar(position = 'fill')
For visualizing the relationship between two numerical variables, we
can use scatter plot and smooth curves.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point() +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 2 rows containing missing values (`geom_point()`).
We can incorporate more variables into a plot by mapping them to additional aesthetics.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = island))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
However adding too many aesthetic mappings to a plot makes it
cluttered and difficult to make sense of.
Another way is to split our plot into facets. To facet out
plot by a single variable, use facet_wrap().
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point(aes(color = species, shape = species)) +
facet_wrap(~island)
## Warning: Removed 2 rows containing missing values (`geom_point()`).
ggsave() will save the plot most recently created to
disk. If we don’t specify the width and height
they will be taken from the dimensions of the current plotting
device.
ggplot(penguins, aes(x = flipper_length_mm, y = body_mass_g)) +
geom_point()
## Warning: Removed 2 rows containing missing values (`geom_point()`).
# ggsave(filename = 'penguin-plot.png')
# ggsave(filename = 'penguin-plot.pdf')
It’s rare that we get the data in exactly the right form we need to make the graph we want. Often we’ll need to create some new variables or summaries. Also we may want to rename the variable or reorder the observations.
Goals - dplyr package - overview of all
the key tools for tranforming a data frame - understand pipe, which is
important tool when combining verbs
library(nycflights13)
library(tidyverse)
To explore the basic dplyr verbs, we’re going to use
nycflights13::flights.
flights
flights is a tibble, a special type of data frame used
by the tidyverse. The most important difference between tibbles and data
frames is the way tibbles print. They are designed for large datasets,
so they only show the first few rows and only the columns that fit on
one screen.
glimpse(flights)
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
Common rules of dplyr
pipe operator |> -
x |> f(y): f(x, y) -
x |> f(y) |> g(z): g(f(x, y), z)
flights |>
filter(dest == 'IAH') |>
group_by(year, month, day) |>
summarize(
arr_delay = mean(arr_delay, na.rm = T)
)
## `summarise()` has grouped output by 'year', 'month'. You can override using the
## `.groups` argument.
dplyr’s verbs are organized into four groups based on what they operate on:
The most important verbs that operate on rows of a dataset are
filter()arrange()distinct()filter() allows us to keep rows based on the values of
the columns. When we run filter(), dplyr executes the
filtering operation, creating a new data frame. It doesn’t modify the
existing dataset. So if we want to save the result, we must use the
assignment operator <-.
arguments are:
# departed more than 120 minutes late
flights |>
filter(dep_delay > 120)
we can also use < <= > >= == != and combine
conditions with & , |. There is a useful shortcut when
we are combining | and ==: %in%.
# flights that departed on January 1
flights |>
filter(month == 1 & day == 1)
# flights that departed in January or Februray
flights |>
filter(month == 1 | month == 2)
flights |>
filter(month %in% c(1, 2))
jan1 <- flights |>
filter(month == 1 & day == 1)
jan1
arrange() changes the order of the rows based on the
value of the columns. If we provide more than one columns name, each
additional column will be used to break ties in the values of preceding
columns. Ascending is defualt and when we want to order by descending,
use desc(column name).
arguments are:
# 가장 빨리 출발한 순서로 정렬
flights |>
arrange(year, month, day, dep_time)
# 가장 지연이 오래된 순서로 정렬
flights |>
arrange(desc(dep_delay))
distinct() finds all the unique rows in a dataset.
However, most of the time, we’ll want the distinct combination of some
variables, so we can also optionally supply column names. If we want to
keep other columns when filtering for unique rows, we can use the
.keep_all = T
# remove duplicate rows
flights |>
distinct()
# find all unique origin and destination pairs
flights |>
distinct(origin, dest)
flights |>
distinct(origin, dest, .keep_all = T)
# count(): find the number of occurrences
# sort = T: arrange then in descending order of number of occurrences
flights |>
count(origin, dest, sort = T)
There are four important verbs that affect the columns.
mutate()select()rename()The job of mutate() is to add new columns that are
calculated from the existing columns.
By default, mutate() adds new columns on the right hand
side of our dataset. .before argument add the variables to
the left hand side. Also we can use .after argument and
both in .before and .after we can use variable
name instead of a position.
Alternatively, we can control which variables are kept with the
.keep argument.
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
flights |> mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.before = 1
)
flights |>
mutate(
gain = dep_delay - arr_delay,
speed = distance / air_time * 60,
.after = day
)
flights |>
mutate(
gain = dep_delay - arr_delay,
hours = air_time / 60,
gain_per_hour = gain / hours,
.keep = 'used'
)
select() allows us to rapidly zoom in on a useful subset
using operations based on the names of the variables.
flights |>
select(year, month, day)
flights |>
select(year:day)
# can also use - instead of !
flights |>
select(!year:day)
flights |>
select(where(is.character))
There are a number of helper functions we can use within
select()
starts_with()ends_with()contains()num_range('x', 1:3)We can rename variables using =
flights |>
select(tail_num = tailnum)
flights |>
rename(tail_num = tailnum)
Use relocate() to move variables around. By default
relocate() moves variables to the front. We can also
specify where to put them using .before and
.after arguments just like in mutate().
flights |>
relocate(time_hour, air_time)
flights |>
relocate(year:dep_time, .after = time_hour)
flights |>
relocate(starts_with('arr'), .before = dep_time)